Cross Validation




Overfitting

Choosing a Model

Recall: We are assuming that

output = f(input) + (noise)

and we would like to estimate f.

Here’s a suggestion:

f( x ) := y

for every (x,y) we observe



Then we win! \(\widehat{y}_i = y_i\) and we have an MSE of zero!

Overfitting

Recall from Assignment 1:

# A tibble: 431 × 6
     age sex      bmi smoker region    charges
   <dbl> <chr>  <dbl> <chr>  <chr>       <dbl>
 1    19 female  27.9 yes    southwest  16885.
 2    33 male    22.7 no     northwest  21984.
 3    32 male    28.9 no     northwest   3867.
 4    31 female  25.7 no     southeast   3757.
 5    60 female  25.8 no     northwest  28923.
 6    25 male    26.2 no     northeast   2721.
 7    62 female  26.3 yes    southeast  27809.
 8    56 female  39.8 no     southeast  11091.
 9    27 male    42.1 yes    southeast  39612.
10    23 male    23.8 no     northeast   2395.
# ℹ 421 more rows

Overfitting

More flexible models fit the data better!

lr_mod <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

bmi_poly_1 <- lr_mod %>%
  fit(charges ~ bmi, data = ins)


bmi_poly_20 <- lr_mod %>%
  fit(charges ~ poly(bmi, 20), data = ins)
ins <- ins %>%
  mutate(
    preds_1 = predict(bmi_poly_1, 
                      new_data = ins, 
                      type = "raw"),
    preds_20 = predict(bmi_poly_20, 
                       new_data = ins, 
                       type = "raw")
  )

Note that type = "raw" outputs the predictions as a vector rather than a dataframe!

1st Order Polynomial

tidy(bmi_poly_1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1477.    2895.      0.510 0.610   
2 bmi             352.      92.3     3.81  0.000159
glance(bmi_poly_1)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0327        0.0305 11694.      14.5 0.000159     1 -4648. 9301. 9314.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

20th Order Polynomial

tidy(bmi_poly_20)
# A tibble: 21 × 5
   term           estimate std.error statistic  p.value
   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)      12297.      560.    22.0   3.34e-71
 2 poly(bmi, 20)1   44565.    11628.     3.83  1.47e- 4
 3 poly(bmi, 20)2   -7613.    11628.    -0.655 5.13e- 1
 4 poly(bmi, 20)3    1448.    11628.     0.125 9.01e- 1
 5 poly(bmi, 20)4    7823.    11628.     0.673 5.01e- 1
 6 poly(bmi, 20)5    7346.    11628.     0.632 5.28e- 1
 7 poly(bmi, 20)6  -18996.    11628.    -1.63  1.03e- 1
 8 poly(bmi, 20)7  -18479.    11628.    -1.59  1.13e- 1
 9 poly(bmi, 20)8  -15574.    11628.    -1.34  1.81e- 1
10 poly(bmi, 20)9   -9785.    11628.    -0.841 4.01e- 1
# ℹ 11 more rows
glance(bmi_poly_20)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0859        0.0414 11628.      1.93 0.00978    20 -4635. 9315. 9404.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Comparing RMSE Between Models

ins %>% 
  rmse(truth = charges, 
          estimate = preds_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      11667.
ins %>% 
  rmse(truth = charges, 
          estimate = preds_20)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      11341.



So, which model is better?

Does the 20th Order Seem Necessary?

Overfitting = “unnecessarily wiggly”

Bias and variance

bias = how much the model is fit to the data it is trained on, instead of being generalizeable to any data


variance = how much prediction error there is on the training data


Solutions to Overfitting

Theoretical solutions to overfitting

One idea is to come up with a metric that penalizes complexity / flexibility in the model.


measure of fit - number of predictors

Examples:

  1. Adjusted R-squared

  2. AIC (Akaike Information Criterion)

  3. BIC (Bayesian Information Criterion)

  4. Mallow’s Cp

Theoretical solutions to overfitting

Pros:

  • Easy to compare models quickly: only one number to compute per model.

  • Basis for each metric has some mathematical justification

Cons:

  • Which one is most justified?

  • What if they don’t agree? (which is common!)


Training and test splits

Training and test data


What if we randomly set aside 10% of our data to be our test data?


We train the model using the remaining 90% of the data.


Then we check the prediction accuracy on the test data, which the model could not possibly have overfit?

Training and test data

# Set seed, so our "randomness" is consistent
set.seed(190498)

# Specifying the proportion of the data to be retained for analysis (training)
ins_split <- ins %>% initial_split(prop = 0.90)
ins_test <- ins_split %>% testing()
ins_train <- ins_split %>% training()
dim(ins)
[1] 431   8
dim(ins_test)
[1] 44  8
dim(ins_train)
[1] 387   8

Training and test data

Fit the models on the training data only


bmi_poly_1 <- lr_mod %>%
  fit(charges ~ bmi, data = ins_train)

bmi_poly_20 <- lr_mod %>%
  fit(charges ~ poly(bmi, 20), data = ins_train)

Training and test data

Find model predictions on the test data only

ins_test <- ins_test %>%
  mutate(
    preds_1 = predict(bmi_poly_1, 
                      new_data = ins_test, 
                      type = "raw"),
    preds_20 = predict(bmi_poly_20, 
                       new_data = ins_test, 
                       type = "raw")
  )

Training and test data

Check model metrics on the test data only

ins_test %>% 
  rmse(truth = charges, 
          estimate = preds_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      10658.


ins_test %>% 
  rmse(truth = charges, 
          estimate = preds_20)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      10537.

Your turn!

Cross-Validation

Cross-Validation

If the test/training split helps us measure model success…

… but it’s random, so it’s not the same every time…

… why not do it a bunch of times?

k-fold Cross-Validation

Your turn!